44 research outputs found

    HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data Analytics

    Get PDF
    To cope with the rapid growth in available data, the efficiency of data analysis and machine learning libraries has recently received increased attention. Although great advancements have been made in traditional array-based computations, most are limited by the resources available on a single computation node. Consequently, novel approaches must be made to exploit distributed resources, e.g. distributed memory architectures. To this end, we introduce HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI. It provides both low-level array computations, as well as assorted higher-level algorithms. With HeAT, it is possible for a NumPy user to take full advantage of their available resources, significantly lowering the barrier to distributed data analysis. When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.Comment: 10 pages, 8 figures, 5 listings, 1 tabl

    HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics

    Get PDF
    In order to cope with the exponential growth in available data, the efficiency of data analysis and machine learning libraries have recently received increased attention. Although corresponding array-based numerical kernels have been significantly improved, most are limited by the resources available on a single computational node. Consequently, kernels must exploit distributed resources, e.g., distributed memory architectures. To this end, we introduce HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload via MPI on arbitrarily large high-performance computing systems. It provides both low-level array-based computations, as well as assorted higher-level algorithms. With HeAT, it is possible for a NumPy user to take advantage of their available resources, significantly lowering the barrier to distributed data analysis. Compared with applications written in similar frameworks, HeAT achieves speedups of up to two orders of magnitude

    The Helmholtz Analytics Toolkit (Heat) and its role in the landscape of massively-parallel scientific Python

    Get PDF
    When it comes to enhancing exploitation of massive data, machine learning methods are at the forefront of researchers’ awareness. Much less so is the need for, and the complexity of, applying these techniques efficiently across large-scale, memory-distributed data volumes. In fact, these aspects typical for the handling of massive data sets pose major challenges to the vast majority of research communities, in particular to those without a background in high-performance computing. Often, the standard approach involves breaking up and analyzing data in smaller chunks; this can be inefficient and prone to errors, and sometimes it might be inappropriate at all because the context of the overall data set can get lost. The Helmholtz Analytics Toolkit (Heat) library offers a solution to this problem by providing memory-distributed and hardware-accelerated array manipulation, data analytics, and machine learning algorithms in Python. The main objective is to make memory-intensive data analysis possible across various fields of research ---in particular for domain scientists being non-experts in traditional high-performance computing who nevertheless need to tackle data analytics problems going beyond the capabilities of a single workstation. The development of this interdisciplinary, general-purpose, and open-source scientific Python library started in 2018 and is based on collaboration of three institutions (German Aerospace Center DLR, Forschungszentrum Jülich FZJ, Karlsruhe Institute of Technology KIT) of the Helmholtz Association. The pillars of its development are... - ...to enable memory distribution of n-dimensional arrays, - to adopt PyTorch as process-local compute engine (hence supporting GPU-acceleration), - to provide memory-distributed (i.e., multi-node, multi-GPU) array operations and algorithms, optimizing asynchronous MPI-communication (based on mpi4py) under the hood, and - to wrap functionalities in NumPy- or scikit-learn-like API to achieve porting of existing applications with minimal changes and to enable the usage by non-experts in HPC. In this talk we will give an illustrative overview on the current features and capabilities of our library. Moreover, we will discuss its role in the existing ecosystem of distributed computing in Python, and we will address technical and operational challenges in further development

    The Helmholtz Analytics Toolkit (HeAT) - A Scientific Big Data Library for HPC -

    No full text
    This talk presents the Helmholtz Analytics Toolkit (HeAT), a HPC data analytics library for scientific applications. HeAT builds on top of PyTorch which provides many required features such as automatic differentiation, CPU and GPU support, linear algebra operations and basic MPI functionalities. However, distributed computations must be designed by hand for each basic communication and furthermore PyTorch implements only a subset of MPI functionalities. HeAT starts at this point providing a distributed tensor data object on which operations can be performed. The tensor data objects reside either on the CPU or on the GPU and, if desired, are distributed over various nodes. Operations on tensor objects are transparent to the user, i.e. they remain the same irrespective of whether the HeAT data object resides on a single node or if it is distributed over several nodes. On the basis of this core structure, HeAT implements typical data analytics methods motivated from various scientific use cases.After motivating the framework and specifying its scope, the talk describes its concept and its realization in detail. The presentation demonstrates the usage of HeAT by means of several typical examples from data analytics. The presentation closes with a discussion on the downsides, further developments and future challenges of HeAT

    The Edge Preserving Wiener Filter for Scalar and Tensor Valued Images

    No full text
    This contribution presents a variation of the Wiener filter criterion, i.e. minimizing the mean squared error, by combining it with the main principle of normalized convolution, i.e. the introduction of prior information in the filter process via the certainty map. Thus, we are able to optimize a filter according to the signal and noise characteristics while preserving edges in images. In spite of its low computational costs the proposed filter scheme outperforms state of the art filter methods working also in the spatial domain. Furthermore, the Wiener filter paradigm is extended from scalar valued data to tensor valued data
    corecore